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ABSTRACT 

There  is  a  style  of  language  characteristic  of  different  subject  areas 
which  is  particularly  noticeable  in  scientific  and  technical  writing. 

It  is  not  only  the  unique  vocabulary  of  a  subject  field  which  sets  it 
apart  from  others,  but  also  the  different  habits  of  writers  in  using 
the  most  common  words.  An  experiment  was  devised  to  test  whether 
these  differences  could  be  used  for  subject  discrimination  in  addition 
to  identification  of  unique  vocabulary,  particularly  to  determine 
whether  or  not  author  variation  in  style  is  sufficiently  great  to 
override  the  variation  from  field  to  field. 

Fifty  IRE  abstracts  in  the  field  of  electronic  computers  and  fifty 
Psychological  Abstracts  were  matched,  one  abstract  at  a  time,  one  word 
type  at  a  time,  against  two  lists  of  words  ranked  in  descending  order 
of  frequency  as  they  occurred  within  two  different  sets  of  three, 
hundred  psychological  and  computer  abstracts.  All  fully  inflected 
forms  of  all  function  and  content  words  were  included  in  the  rankings. 

Using  the  first  50  ranks  only  of  the  two  lists,  93#  of  the  abstracts 
were  successfully  discriminated.  For  the  first  75  and  100  ranks,  the 
success  rates  were  96#  and  97#;  respectively. 
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RANK  ORDER  PATTERNS  OF  COMMON  WORDS  AS  DISCRIMINATORS  OF  SUBJECT 
CONTENT  IN  SCIENTIFIC  AND  TECHNICAL  PROSE 


Introduction 

There  is  little  reason  to  be  satisfied  with  current  information  system  designs 
either  for  dissemination  or  retrieval.  The  use  of  condensed  representations 
in  the  form  of  class  categories  or  index  terns  has  limitations.  Systems 
using  such  devices  appear,  inherently,  to  produce  a  great  deal  of  "noise," 
as  can  be  seen  in  the  recent  work  on  relevance/ recall  ratios.  Whole  text 
or  "natural  language"  processing  approaches  appear  to  offer  the  greatest 
promise  of  improvement  in  retrieval  systems.  The  designers  of  prose  pro¬ 
cessing  schemes,  however,  have  encountered  serious  difficulties  in  building 
systems  which  are  both  practical  and  economical.. 

A  major  problem  in  working  with  natural  language  is  the  range  of  variation 
in  linguistic  behavior.  The  wide  range  of  variation  has  been  an  obstacle 
to  successful  predictive  generalization,  whether  applied  to  mechanical  or 
human  information  storage  and  retrieval.  One  reason  for  the  current  diffi¬ 
culties  is  that  we  do  not  have  a  sufficiently  precise  knowledge  of  the 
stochastic  parameters  of  language,  particularly  as  it  is  used  in  different 
subjects  and  contexts.  A  second  reason  is  that  efforts  directed  at  statis¬ 
tical  techniques  of  linguistic  analysis  have  concentrated  upon  the  relatively 
infrequent  verbal  constructs. 
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It  has  been  a  common  practice  in  building  language  processing  programs  to 
reduce  the  number  of  different  entities  which  must  be  handled  by  excluding 
the  most  common  articles,  prepositions,  conjunctions  and  auxiliary  verb 
forms,  and  by  combining  inflected  forms  of  common  roots.  Such  procedures 
do  result  in  the  loss  of  a  certain  amount  of  information.  Through  reading 
the  reports  of  G.  Yule  and  G.  Herdan  and  of  F.  Mosteller  and  D.  Wallace 
in  establishing  the  authorship  of  disputed  works,  I  was  led  to  consider  ways 
in  which  this  lost  information  could  be  recovered  and  used  to  supplement 
established  methods.  G.  K.  Zipf  had  already  shown  one  way  of  using  rank 
order  distributions  of  words.  OtherB  have  indicated  that  there  is  a 
considerable  range  of  variation  in  the  way  individual  authors  use  the  most 
commonly  occurring  words  in  a  language  in  different  contexts. 

There  is  a  style  of  language  characteristic  of  different  subject  areas 
which  is  particularly  noticeable  in  scientific  and  technical  writing.  It 
is  not  only  the  unique  vocabulary  of  a  subject  field  which  sets  it  apart 
frcm  others,  but  also  the  different  habits  of  writers  in  different  fields 
in  using  common  prepositions,  nouns,  and  verbs.  This  is  most  clearly  illus¬ 
trated  in  mathematical  writing,  in  which  symbology  is  embedded  in  a  highly 
stylized  form  of  prose,  sufficiently  unlike  ordinary  language  to  be  considered 
a  distinct  dialect.  The  growth  of  "dialects"  in  this  sense  is  common  to  all 
subjects  in  varying  degrees.  The  question  is  whether  these  behavioral 
differences  are  sufficiently  distinctive  to  provide  a  basis  for  subject 
discrimination  in  addition  to  the  identification  of  unique  vocabulary. 
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One  of  the  first  considerations  in  estimating  whether  a  practical  discrimi¬ 
nator  could  be  built  was  whether  or  not  author  variation  in  style  is  suffi- 

s 

ciently  great  to  override  the  variation  from  field  to  field.  An  experiment 
was  devised  to  test  this  proposition  and  to  gather  evidence  for  identification 
of  statistical  parameters  and  techniques  useful  for  subject  discrimination. 

The  Experiment 

An  experimental  corpus  was  selected  consisting  of  350  Psychological  Abstracts 

and  350  IRE  -ostracts  from  the  Transactions  of  the  Professional  Group  on 

Electronic  Computers  (PGEC).  The  abstracts  were  available  at  System 

* 

Development  Corporation  in  machine -readable  form.  This  corpus  was  con¬ 
sidered  to  provide  an  adequate  reflection  of  author  variation,  in  that  the 
abstracts  had  largely  been  written  by  different  persons,  including  authors 
Of  the  papers  abstracted. 

Three  hundred  psychological  abstracts  and  three  hundred  PGEC  abstracts  were 
% 

taken  from  the  corpus  for  establishment  of  population  "profiles"  of  the  two 
subject  areas.  The  profiles  consisted  of  two  lists  of  the  most  frequent  100 
words  ranked  in  descending  order  of  occurrence  within  the  two  sets  of  300 
abstracts.  A  System  Development  Corporation  computer  program  called  FEAT 
was  used  to  provide  the  counts  and  listings.  The  Appendix  presents  a 
consolidated  alphabetic  list  of  the  words  in  the  two  profiles,  together 
with  their  rank  numbers. 


The  abstracts  were  drawn  frcm  the  experimented  sets  used  originally  by 
Borko  for  automatic  classification  emd  by  Maron  for  automatic  indexing. 
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Where  occurrence  frequencies  of  tvo  or  more  vords  were  equal,  a  word  length 
criterion  was  applied  such  that  the  shorter  word  was  given  the  higher  rank. 
This  was  based  on  the  assumption  that,  in  general,  short  words  are  more 
prevalent  than  long.  When  word  length  as  well  as  frequency  were  equal, 
the  words  were  ranked  in  alphabetic  order. 

A  version  of  the  FEAT  program  was  used  to  count  and  list  the  words  in  each 
of  the  100  abstracts  remaining  in  the  experimental  corpus  of  TOO.  Each 
abstract  was  matched,  one  word  type  at  a  time,  against  the  two  profiles  of 
100  rank-ordered  words.  The  words  in  each  abstract  occurring  in  one  or 
both  of  the  two  profiles  were  recorded,  together  with  their  rank  numbers. 

The  purpose  of  this  procedure  was  to  segregate  the  abstracts  into  two  files- 
psychological  and  FGrEC  abstracts,  respectively.  After  considering  a  number 
of  decision  rules,  the  following  criteria  were  adopted: 

1.  An  abstract  belongs  to  psychology  if  the  number  of  words  in  common 
with  the  psychology  profile  is  greater  than  the  number  in  common 
with  the  PGEC  profile,  and  conversely. 

2.  If  the  number  of  words  in  common  in  the  abstract  and  the  two 
profiles  were  equal,  the  sum  of  the  rank  numbers  of  those  words  on 
the  two  lists  would  be  determined,  and  the  abstract  assigned  to 
the  profile  with  the  smaller  sum.  If  the  sums  were  equal,  no 
decision  would  be  made. 
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Figures  1  and  2  illustrate  the  data  recorded  and  the  results  of  matching 
two  abstracts  against  the  first  50,  75,  and  the  full  100  ranks  of  the  two 
profiles.  In  both  cases  the  number  of  words  in  the  abstracts  contained  in 
the  first  50  ranks  of  the  two  profiles  is  the  same.  Summing  the  rank  numbers 
permits  both  abstracts  to  be  correctly  discriminated  by  the  rule  given. 

Hie  following  table  summarizes  the  results  of  matching  the  psychological 
and  PGEC  abstracts  against  the  first  50,  75,  and  100  ranks  of  the  profiles: 

Number  Correctly  Discriminated  for 
50  Ranks  75  Banks  100  RankB 


50  Psychological  Abstracts 

43 

46 

47 

50  IRE  PGEC  Abstracts 

50 

50 

50 

Success  Ratio 

93$ 

96$ 

97$ 

All  of  the  abstracts  which  were  cast  into  the  "wrong"  category  by  this 
procedure  were  psychological  abstracts.  Examination  of  the  abstracts  con¬ 
tributing  to  the  profiles  suggests  several  reasons  for  this.  Hie  PGEC 
abstracts  represent  a  more  specialized  subject  matter  than  those  from 
Psychological  Abstracts.  In  general,  the  PGEC  abstracts  contain  fewer 
word  types  used  more  frequently.  Consequently  the  counts  contributing  to 
the  PGEC  profile  are  higher  than  those  of  psychology. 

In  examining  the  results  it  was  found  that,  at  the  100  rank  level,  88$  of 
the  successfully  discriminated  abstracts  were  dependent  on  the  52  words 
that  are  unique  to  each  profile,  with  9$  successfully  decided  through  summing 
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PSYCHOLOGICAL  ABSTRACT  *  1  - 

54  word  types 

Word  in  Abstract 

Psych .  Prof  I  le 

PGEC  Profile 

50R  75R  100R 

50R  75R  100R 

a 

6 

3 

and 

3 

4 

be 

17 

13 

but 

-  63 

-  -  - 

by 

14 

14 

first 

-  74 

-  -  - 

have 

-  56 

-  69 

information 

- 

40 

in 

4 

7 

is 

7 

5 

of 

2 

2 

on 

13 

16 

the 

1 

1 

to 

5 

6 

were 

18 

-  -  - 

with 

9 

17 

No.  words  in  common  12  15  15  12  13  13 


Rank  no .  sum  99  1 28 

Figure  1 

IRE  PGEC  ABSTRACT  *  1  -  15  word  types 


Word  in  Abstract 

Psych  Profile 

PGEC  Profile 

50R 

75R 

100R 

50R 

75R 

100R 

are 

r\ 

9 

automatic 

- 

- 

- 

- 

- 

80 

be 

17 

13 

considered 

- 

- 

- 

- 

- 

85 

data 

- 

- 

80 

37 

may 

50 

- 

- 

91 

of 

2 

2 

or 

21 

27 

that 

12 

19 

no .  words  in  common 

6 

6 

7 

6 

6 

9 

Rank  no .  sum 

110 

110 

107 

107 

Figure  2 
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the  rank  numbers.  Zt  was  considered  useful  to  Investigate  the  discrimination 
to  be  obtained  by  the  rank  sum  criterion  alone,  using  only  words  common  to 
the  profiles. 

There  are  48  words  in  common  on  the  profiles  in  the  first  100  ranks. 

Figure  3  lists  the  words  in  common  and  their  ranks.  The  mean  difference  of 
rank  for  these  words  is  17*4,  with  the  lower  ranks  tending  to  larger  differ¬ 
ences  than  the  higher  ranks.  As  can  be  seen  from  the  figure,  function  words 
predominate.  The  following  table  shows  the  results  of  matching  the  100 
abstracts  against  the  list  of  48  words  common  to  the  profiles^  and  applying 
the  rank  sum  criterion: 


Correct  Incorrect 


50  Psychological  Abstracts 

36 

14 

50  IRE  PGEC  Abstracts 

42 

8 

Percentage 

78* 

22* 

Conclusions 

The  results  of  this  experiment  Indicate  that  author  Variation  m  style 
imposes  no  serious  obstacle  to  using  patterns  of  common  words  as  discrimi¬ 
nators.  Considering  the  length  of  the  profiles,  the  small  size  of  the 
sample  contributing  to  the  profiles,  and  the  limited  number  of  word  types 
contained  in  individual  abstracts,  the  success  ratios  are  surprisingly 
high.  It  is  uncertain,  however,  to  what  degree  the  results  are  biased  by 
editorial  conventions  and  style. 
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The  results  also  tend  to  support  the  idee  that  there  is  much  useful  infor¬ 
mation  to  he  found  in  the  high  frequency  area  of  word  occurrence,  sued  that 
frequency  alone  can  provide  a  basis  for  subject  discrimination  of  widely 
different  fields,  particularly  when  all  word  type  occurrences  of  fully 
inflected  foms  are  taken  into  account.  Further  work  is  required  to 
establish  the  precision  which  may  be  expected  of  such  a  technique,  especially 
if  applied  to  fields  more  closely  related  than  psychology  and  computers. 

Potential  Applications 

A  system  designed  to  make  use  of  common  word  patterns  through  a  technique 
similar  to  that  described  in  this  paper  would  include  a  short  table  intended 
to  combine  the  functions  of  an  exclusion  list  with  identification  of  broad 
subject  areas.  Such  a  quick  initial  segregation  would  reduce  the  search 
time  required  for  matching  against  the  particular  vocabulary  of  those  areas. 
Figure  4  illustrates  the  contrast  between  using  a  large  dictionary  with  the 
familiar  features  of  exclusion  lists,  root  stripping  and  cm  extended  search 
of  a  long  table  and  the  approach  suggested  here.  The  Initial  segregation 
would  lead  directly  to  a  relatively  short  specialized  dictionary  or  to  a 
mis -match  monitor.  The  thesaurus  devices  necessary  to  a  large  dictionary 
could  be  simplified,  and  the  range  of  ambiguity  Inherent  to  terms  used  in 
many  different  fields  would  be  narrowed.  It  is  quite  feasible  to  use 
specialized  tables  now,  provided  the  texts  are  segregated  by  subject  prior 
to  input.  This  approach,  however,  looks  forward  to  the  application  of  optical 
readers  for  the  transformation  of  printed  text  to  machine  readable  form  in 
systems  that  do  not  require  the  intervention  of  a  human  mind  for  prior  sub¬ 
ject  classification. 
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PreeMMd  Text 


Figure  4 

Schematic  Flow  Contrasting  a  Conventional  Technique 
■with  Suggested  Approach  Using  Common  Word  Patterns 
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APPENDIX 
The  Profiles 


Hie  300  Psychological  Abstracts  used  to  build  the  rank-ordered  profiles  for 
this  experiment  contained  a  total  of  22,175  word  occurrences  of  4,587  word 
types.  The  300  IRE  PGEC  abstracts  contained  23,200  word  occurrences  of 
3,678  word  types.  The  mean  number  of  word  occurrences  per  abstract  was  77.3 
for  PGEC  versus  73*9  for  Psychology.  When  broken  into  subsets,  both  samples 
exhibited  a  broad  internal  range  of  variation  for  the  expectation  that  a 
given  word  would  appear  at  a  given  rank,  with  the  broader  range  appearing 
in  the  Psychological  Abstract  set. 

The  following  table  presents  a  consolidated  alphabetic  list  of  words  occurring 
in  the  first  100  ranks  of  the  IRE  PGEC  and  Psychological  Abstract  Profiles, 
together  with  their  rank  numbers.  A  (  — )  is  used  Instead  of  a  rank 

number  to  indicate  that  the  word  docs  not  occur  in  the  first  100  ranks  of 
cne  or  other  of  the  profiles. 


Word  Type 

Rank  Number 
Psych.  PGEC 

a 

06 

03 

all 

99 

— 

an 

16 

12 

analog 

— 

42 

analysis 

42 

-- 

and 

03 

04 

any 

65 

are 

08 

09 

as 

11 

20 

at 

39 

43 

author 

6S 

-- 

automatic 

— 

80 

be 

17 

13 

been 

-- 

77 

behavior 

27 

— 

between 

22 

__ 

binary 

— 

86 

both 

— 

97 

but 

63 

by 

14 

lh 

can 

100 

23 

change 

92 

— 

Word  Type 

Rank  Number 
Psych.  PGEC 

circuit 

46 

circuits 

— » 

34 

computer 

— 

10 

computers 

— 

45 

considered 

— 

85 

control 

— 

56 

counseling 

87 

mm 

data 

80 

37 

described 

— 

15 

design 

36 

development 

38 

— 

differences 

98 

__ 

different 

97 

__ 

digital 

26 

discussed 

34 

21 

during 

75 

-- 

each 

84 

54 

effect 

37 

mm 

effects 

57 

mm 

electronic 

• 

60 

elements 

MSS 

87 

equations 

— 

76 

Rank  Number 

Word  Type 

Psych. 

PGEC 

factors 

68 

— 

findings 

95 

— 

first 

74 

-  - 

for 

10 

08 

form 

— 

92 

found 

53 

“■ 

from 

-- 

32 

function 

96 

68 

functions 

-- 

47 

general 

69 

94 

given 

““ 

35 

group 

32 

**  “ 

groups 

7° 

has 

64 

58 

have 

56 

69 

his 

55 

— 

human 

81 

mm 

in 

Ok 

07 

information 

mm 

40 

input 

— 

100 

into 

91 

64 

is 

07 

05 

it 

23 

24 

its 

49 

90 

language 

— 

71 

learning 

31 

logic 

93 

logical 

52 

machine 

30 

magnetic 

— 

33 

may 

50 

91 

means 

—  — 

74 

memory 

— 

28 

mental 

93 

method 

44 

22 

methods 

72 

63 

more 

30 

66 

network 

— 

95 

new 

90 

53 

no 

79 

-- 

not 

24 

-- 

number 

94 

50 

of 

02 

02 

on 

13 

16 
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Psych. 

pge6 

one 

65 

4l 

only 

85 

— 

operation 

-- 

55 

operations 

-- 

96 

or 

21 

27 

other 

43 

70 

out 

— 

99 

output 

— 

84 

part 

73 

perception 

77 

-* 

performance 

83 

*“  * 

personality 

52 

possible 

-- 

98 

presented 

58 

59 

problem 

— 

75 

problems 

51 

88 

program 

51 

programming 

83 

psychological 

78 

m  m- 

psychology 

59 

mm 

reinforcement 

89 

mm 

relationship 

70 

required 

mm 

89 

research 

47 

response 

54 

— 

results 

35 

set 

— 

72 

shown 

— 

73 

social 

25 

— 

solution 

— 

79 

seme 

45 

67 

storage 

— 

57 

study 

28 

-- 

such 

71 

48 

switching 

— 

39 

system 

82 

18 

38 

81 

systems 

technique 

-- 

techniques 

-- 

82 

test 

40 

— 

than 

29 

49 

that 

12 

19 

the 

01 

01 

their 

61 

— 
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theory 

26 

... 

used 

mtm 

29 

these 

33 

44 

using 

mm 

78 

this 

19 

25 

various 

86 

time 

46 

62 

visual 

67 

... 

to 

05 

06 

was 

15 

... 

two 

36 

6l 

were 

18 

.. 

under 

4l 

-- 

when 

6o 

.. 

use 

— 

31 

which 

20 

11 
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09 

17 
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Unclassified  report 

DESCRIPTORS:  Information  Retrieval. 
Documentation. 

States  that  there  is  a  style  of  language 
characteristics  of  different  subject 
areas  which  Is  particularly  noticeable  In 
scientific  and  technical  writing.  Also 
states  that  the  unique  vocabularly  of  a 
subject  field  and  the  different  kinds  of 
writers  sets  It  apart  from  others. 


Reports  that  an  experiment  was  devised 
to  test  whether  these  language  differences 
could  be  used  for  subject  discrimination  in 
addition  to  Identification  of  unique 
vocabulary,  particularly  to  determine 
whether  or  not  author  variation  In  style  is 
sufficiently  great  to  override  the 
variation  from  field  to  field.  Also  reports 
that  50  IRE  abstracts  In  the  field  of 
electronic  computers  and  fifty  Psychological 
Abstracts  were  matched  against  two  lists  of 
words  ranked  in  descending  order  of  frequency 
as  they  oc cured  within  two  different  sets  of 
three  hundred  psychological  and  computer 
abstracts.  States  that  using  the  first 
50  ranks  of  the  two  lists  of  abstracts  and 
words,  93$  of  the  abstracts  were  successfully 
discriminated  and  for  the  first  75  and  100 
ranks,  the  success  rates  were  9 6$  and  97$, 
respectively. 
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